Prosper is a peer-to-peer lending platform that allows people to invest in each other by connecting investors and borrowers through its service.
In this Exploratory Data Analysis, I explore a Prosper Loan dataset consist of more than 100,000 loans originated between 2006 and 2014. After performing a review of variables (there are 81 variables available), I listed a set of variables that fall into the scope of my investigation.
I conducted this EDA with little to no knowledge about Prosper. Therefore, my EDA will focus on the get to know aspect of Prosper as a company (its history, borrower, and investor). I hope people who are new to peer-to-peer lending, either as a borrower or an investor, and want to know more about Prosper can benefit from this EDA.
Below is the global setting used for R Markdown in this project:
knitr::opts_chunk$set(echo = FALSE,
warning = FALSE,
message = FALSE,
fig.width = 10,
fig.height = 5)
The following R packages are also being used in this project:
library(ggthemes)
library(dplyr)
library(tidyr)
library(ggplot2)
library(grid)
library(gridExtra)
library(maps)
library(mapproj)
library(GGally)
library(scales)
Let’s take a look at the dimension of the original dataset:
dim(loan_data)
## [1] 113937 81
As we can see, the dataset consists of 113,937 observations and 81 variables. Clearly, we would not be able to utilize all these variables at the same time. Instead, after conducting a quick review of their definition, I have chosen several variables that I want to focus on exploring. Furthermore, I have also combined some variables from the original dataset to defined some new variables.
Here is the current dataset structure used in the EDA:
str(prosper)
## 'data.frame': 113081 obs. of 15 variables:
## $ risk_level : Ord.factor w/ 7 levels "HR"<"E"<"D"<"C"<..: 4 6 1 6 3 5 2 4 7 7 ...
## $ term : Ord.factor w/ 3 levels "12"<"36"<"60": 2 2 2 2 2 3 2 2 2 2 ...
## $ loan_status : Ord.factor w/ 6 levels "Current"<"FinalPaymentInProgress"<..: 4 1 4 1 1 1 1 1 1 1 ...
## $ closed_date : Date, format: "2009-08-14" NA ...
## $ borrower_rate : num 0.158 0.092 0.275 0.0974 0.2085 ...
## $ listing_category : Factor w/ 21 levels "NA","Debt Consolidation",..: 1 3 1 17 3 2 2 3 8 8 ...
## $ borrower_state : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
## $ occupation : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
## $ income_range : Ord.factor w/ 8 levels "Not displayed"<..: 5 6 1 5 8 8 5 5 5 5 ...
## $ loan_original_amount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ loan_origination_date : Date, format: "2007-09-12" "2014-03-03" ...
## $ average_credit_score : num 650 690 490 810 690 ...
## $ average_investment_amount: num 36.5 10000 73.2 63.3 750 ...
## $ return_on_investment : num 0.2 0 0.39 0 0 0 0 0 0 0 ...
## $ loss_rate : num 0 0 0 0 0 0 0 0 0 0 ...
Surely, after the wrangling process, we are left with less observation from the original dataset. Currently we have 113,081 observations. Furthermore, there are 10 variables taken directly from the original dataset, and 5 others defined from a combination of variables from the original dataset. Below are a short definition of each variable names.
Since we’re dealing with several years worth of data, the first thing I’ll do is to check the time period in which these loans were originated:
## [1] "Loan Origination Period: 2006-04-21 to 2014-03-12"
Looking at these dates, I wonder why did the loan origination date start only from April 21, 2006, and end on March 12, 2014, instead of started at the beginning of 2006 and ended at the end of 2014? Out of curiosity, I decided to conduct an online search about Prosper, and as a result, I found out that the company was founded in 2005, in California, and not until February 5, 2006, did they open their marketplace to the public. Therefore, it is reasonable enough to have their first loan only originated on April 21 that year. However, I didn’t find anything particular about the upper bound date. I think, that might just be the latest data available at the time the dataset was acquired for this project.
Going forward in this analysis, we need to recognize that, when comparing data over the year, data for 2006 and 2014 are not complete.
Now let us see the annual loan origination count time series:
We can see that Prosper originated 5,239 loans in their first year in 2006, then double up the count in 2007. However, business was stagnant in 2008, and the final count was flat at 11,551 count. Suddenly, in 2009, the company went downhill, and the marketplace only able to originate 1,989 loans. It took Prosper about two years to reach an annual loan origination count higher than the 10,000 mark again. Since then, the graph suggests that Prosper have recovered, and the loan origination count has increased steadily, reaching 34,345 loans in 2013.
In 2014, although loan count appears to have declined, but we already know that we only have data for the first three months (first quarter) of 2014. If Prosper were able to originate the same number of loans as they did in the first quarter of that year, for each remaining quarters, they should be able to close the year originating around 48,000 loans; a higher number than the annual loan origination count for 2013.
However, an interesting insight from this graph actually came from the sharp decline in the 2009 annual loan origination count from the previous year number. Of course, the first possible explanation arise from the time when the incident happened, which coincided with the global financial crisis of 2007-2008. Reasonably enough, one might asked: could the crisis caused a slowing down in the marketplace? To answer this question, let us investigate the distribution further:
Notice the gap between late 2008 to mid 2009. Obviously, this is not what I expected to see as an indication of a slowing down in the marketplace. I anticipated a low loan origination count in this period, but the graph shows zero counts instead; suggesting not a slowing down, but as if Prosper had temporarily stopped doing business at all. Surely, there must be an exceptional event, beyond the global financial crisis, that was more specific to the company itself, that caused this quiet period. It is the time to do another online research again.
The online research revealed that on November 24, 2008, the U.S. Securities and Exchange Commission (SEC) imposed a cease and desist order on Prosper, and required Prosper to register its business under the Security Act. Apparently, Prosper spent months to complete the registration, during which it temporarily stopped giving new loans. Upon its comeback from the quiet period, Prosper introduced some changes to its business model, and rebranded the new era as Prosper 2.0. Subsequently, the period prior to the pre-registration is later known as Prosper 1.0.
Since we’ve unexpectedly learned a general history of Prosper just by examining the histogram of its annual loan origination count, I think the histogram deserved to be displayed as the first final plot in this EDA. I will add another aesthetic to the plot to depict additional information to the final version. Additionally, moving forward in this Exploratory Data Analysis, I will focus my analysis on comparing between Prosper 1.0 and Prosper 2.0, to see the transformation of the company as it grows over the years, and to hopefully determine which one of them performed better than the other.
Now let us see the options that borrowers have in regard to loan terms:
## prosper$loan_origination_period: Prosper 1.0
## 12 36 60
## 0 28244 0
## --------------------------------------------------------
## prosper$loan_origination_period: Prosper 2.0
## 12 36 60
## 1613 58679 24545
Apparently, the only option available for Prosper 1.0 was the 36-month loan term. The 12-month, and the 60-month loan terms only available for Prosper 2.0, though the 36-month loan term was still an option too. In fact, the 36-month loan was the most popular option in this later period (58,679 count), followed by the 60-month loan (24,545 count), and 12-month loan (1,613 count).
Unfortunately, since the 12-month and the 60-month loan terms were not available for Prosper 1.0, then I would not be able to make a comparison between these two periods for the 12-month and the 60-month loans terms. Therefore, moving forward in this EDA, I will focus my analysis only on the 36-month loans, the only type available for both Prosper 1.0 and Prosper 2.0; the two periods that I want to compare. I will revisit this loan term issue again in bivariate analysis to see the exact timeline of when these options became available.
Now let us analyse how much money the borrowers lent in each loan:
## prosper_36$loan_origination_period: Prosper 1.0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 2500 4500 6208 8000 25000
## --------------------------------------------------------
## prosper_36$loan_origination_period: Prosper 2.0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 5500 7831 10000 35000
Both periods appear to have a positively skewed distribution on loan original amount, suggesting that most borrowers were taking smaller loans. Apparently, the minimum loan amount was $1,000 for both periods, but the maximum loan amount does seem to have increased from $25,000 in the earlier version to $35,000 some time during the latest version. As expected, the mean, as well as the median, and the first and third quartiles, of loan original amount in Prosper 2.0 are higher than they are in Prosper 1.0; suggesting borrowers in Prosper 2.0 might have taken slightly higher loan than those in earlier period. Additionally, I also notice the spikes around values, such as around: $10,000, $15,000, $20,000, and even $30,000 and $35,000 in Prosper 2.0. I assume that it is an indication of borrowers rounding up, or estimating the amount of money they need to lend, instead of requesting the exact amount to the dollar.
Now let’s see how the investors invested their money between the two periods:
## prosper_36$loan_origination_period: Prosper 1.0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.15 36.36 54.38 144.70 92.00 25000.00
## --------------------------------------------------------
## prosper_36$loan_origination_period: Prosper 2.0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.30 58.33 125.00 3110.00 4000.00 35000.00
Note that, for better visualization, the histogram above only shows investment amount of $10,000 or less.
Both periods have a positively skewed distribution on investment amount, suggesting that most investments were in a smaller amount, especially in Prosper 1.0. As we can see from the summary above, 75% loans in the first period have an average investment amount of $92 or less, while half loans in the second period have an average investment amount of $125 or more. Moreover, 25% loans in Prosper 2.0 have an average investment amount of $4,000 or more. Clearly, investors in Prosper 2.0 does seem to be more confident investing in larger amount per loan than those in Prosper 1.0.
Furthermore, the histogram of Prosper 2.0 shows spikes at around $4,000, and $10,000 depicting the popular higher investment amount. Not shown in the graph though, some investors in Prosper 2.0 went as high as investing $35,000 in a single loan, while the largest investment in a single loan in Prosper 1.0 was only $25,000. Remember, these are in fact, the maximum loan amount allowed in each periods.
As a background, Risk Level is a measure of risk for a borrower to go default on the loan. Obviously, there is no straightforward formula to calculate the risk, and over time, a company, such as Prosper, is bound to update its calculation method to better reflect the risk. For example, Prosper did introduce a new grading system for its risk calculation at the beginning of Prosper 2.0. They named the current rating method as Prosper Rating. Previously, they used a different approach to determine the risk, and called it Credit Grade. Since both Prosper Rating and Credit Grade measure the same risk and use the same measurement values (AA to HR), so I joined them together in one variable named Risk Level.
## prosper_36$loan_origination_period: Prosper 1.0
## HR E D C B A AA
## 3416 3219 5077 5556 4312 3243 3421
## --------------------------------------------------------
## prosper_36$loan_origination_period: Prosper 2.0
## HR E D C B A AA
## 6934 7612 9804 10175 9005 10662 4487
Now, notice how Risk Level is normally distributed during Prosper 1.0, with Grade C was having the highest count (5,556), followed by grade D (5,077 count) and B (4,312 count). However, in Prosper 2.0, grade A was the one having the highest count (10,662), which is, of course, more favorable to lenders because this group consist of highly reliable borrowers. However, Grade AA in the second period has the lowest count (4,487), and I wonder why? Could it be as a result of the newly implemented stricter grading system in this period? We will come back to this later.
Credit Score is an alternate method of measuring the risk of a borrower to go default on a loan. The term Average Credit Score that is being used in this EDA means that this value is an average number of the highest and the lowest credit scores of the borrower, taken from different credit agencies. Furthermore, since credit score measures creditworthiness of a borrower, expect this number to be one of the many factors that affect Risk Level.
So, let us review the distribution:
## prosper_36$loan_origination_period: Prosper 1.0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.5 609.5 649.5 656.6 709.5 889.5
## --------------------------------------------------------
## prosper_36$loan_origination_period: Prosper 2.0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 609.5 669.5 709.5 706.5 729.5 889.5
Please note that for a better display purpose, I have ignored few lower outliers in Prosper 1.0.
It’s easy to notice the obvious difference in the lower bound between the two groups as an indication of the implementation of minimum credit score requirement during Prosper 2.0. Notice that half of the loans in Prosper 1.0 were from a borrower with an average credit score of less than 650, while about 75% of loans in Prosper 2.0 were from a borrower with an average credit score of higher than 650. In fact, the minimum average credit score in this period is 609.5.
Obviously, one might asked, did the minimum credit score requirement affect the loan origination count? Well, later on, in this EDA, we will return to the issue of minimum credit score requirement, and elaborate on the reason for and the probable impact of its implementation. Also, we will explore its possible impact on investor’s confidence on lending at Prosper.
## prosper_36$loan_origination_period: Prosper 1.0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1275 0.1700 0.1835 0.2368 0.3600
## --------------------------------------------------------
## prosper_36$loan_origination_period: Prosper 2.0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0423 0.1274 0.1899 0.1986 0.2699 0.3600
Despite the minimum credit score requirement in Prosper 2.0, borrower’s rate seems to be slightly higher in this period than it was in Prosper 1.0. Notice that median, mean, and third quartile of borrower’s rate were higher in Prosper 2.0, although the maximums stayed the same in both periods at 0.36. We can also see from the graph, that the highest spike in Prosper 2.0 is around the 0.31 rate, while in Prosper 1.0 it is around the 0.15 value. Interestingly, the minimum borrower’s rate in Prosper 1.0 is at 0. I wonder how that’s even possible. Additionally, on the graph, notice that there are several other spikes in both graphs, such as around 0.25, 0.29, 0.35 in Prosper 1.0 and around 0.19 in Prosper 2.0 . I wonder whether these spikes indicate additional concentration points of borrower rate due to different risk levels. Unfortunately, I will not be able to confirm it through this graph only. Though simply by adding the risk level variable to the graph’s aesthetic, surely will bring a better insight into this matter.
Below are the maps showing loan origination count in each states for each periods: I would like to say that the maps above look elegant, unfortunately, it is kind of hard to recognize the order of the states by its loan count. I do notice that California was the state with the highest loan count for both periods, but I can barely say which state came second. Therefore, I decided to just list the order here:
## loan_origination_period state_full_name count
## 1 Prosper 1.0 california 3929
## 2 Prosper 1.0 georgia 1647
## 3 Prosper 1.0 illinois 1643
## 4 Prosper 1.0 florida 1302
## 5 Prosper 1.0 texas 1179
## 6 Prosper 1.0 michigan 967
## loan_origination_period state_full_name count
## 1 Prosper 2.0 california 7705
## 2 Prosper 2.0 new york 4093
## 3 Prosper 2.0 texas 4000
## 4 Prosper 2.0 florida 3834
## 5 Prosper 2.0 illinois 2966
## 6 Prosper 2.0 georgia 2300
It’s clear now that Georgia was the state with the second highest loan count in the first period, but New York took the spot in the second period. Georgia wasn’t even in the top six list in Prosper 2.0, while New York wasn’t in the top six list in Prosper 1.0. However, five other states with highest loan count in the first period all made it into the same list in later period, though not in the same order. Very interesting indeed. I wonder what cause the shift in these lists? I think this would be an interesting topic to follow through, but not the one that I would like to pursue in this EDA. Therefore, I just have to settle with the knowledge that California holds the highest loan count, which is reasonable since it is the state where the company was founded and is headquartered.
Note that: since borrower’s occupation that was listed as “other” dominates the rest occupation type in loan count, I decided to removed it from the histogram to better display the distribution of borrower’s true occupations.
As we can see, there are a large variety of borrower’s occupation. It includes waiter/waitress, truck drivers, religious, pilot, attorney, accountant/CPA, etc. Unfortunately, the vast majority of borrowers prefer to hide their occupation by selecting “other”, making it hard to get a meaningful insight about the real distribution of borrower occupation. Nonetheless, here are the top six occupations by loan count:
## loan_origination_period occupation count
## 1 Prosper 1.0 Other 7243
## 2 Prosper 1.0 Professional 3062
## 3 Prosper 1.0 1633
## 4 Prosper 1.0 Computer Programmer 1233
## 5 Prosper 1.0 Sales - Commission 1094
## 6 Prosper 1.0 Clerical 1040
## loan_origination_period occupation count
## 1 Prosper 2.0 Other 14792
## 2 Prosper 2.0 Professional 7092
## 3 Prosper 2.0 Executive 2316
## 4 Prosper 2.0 Computer Programmer 2290
## 5 Prosper 2.0 Teacher 2016
## 6 Prosper 2.0 Administrative Assistant 1972
Notice how “other” and “” (blank) are at the first and third position on the Prosper 1.0 list. The blank occupation doesn’t appear on the Prosper 2.0 list, which is an improvement and could be another sign of stricter requirements imposed in Prosper 2.0.
Moreover, notice that Professional and Computer Programmer are on both lists. Since California is the state with most loan count and is known for its technology companies, could most of these professionals and computer programmers be indeed from California? This surely is an interesting point to check. Furthermore, notice how Sales - Commission and Clerical jobs are listed as the top six on the first list, but both are missing in the second list; while Executive, Teacher, and Administrative Assistant, that are not in the first list, emerge into the second list. Could these shifts have any relation to the shift in borrower’s state that we’ve seen earlier?
Unfortunately, we can barely rely on the analysis above because of the very high proportion of occupation listed as “other” and “” that could masked the true distribution of borrower’s occupation.
## prosper_36$loan_origination_period: Prosper 1.0
## Not displayed Not employed $0 $1-24,999 $25,000-49,999
## 7017 157 576 2608 7983
## $50,000-74,999 $75,000-99,999 $100,000+
## 5384 2402 2117
## --------------------------------------------------------
## prosper_36$loan_origination_period: Prosper 2.0
## Not displayed Not employed $0 $1-24,999 $25,000-49,999
## 0 509 41 3788 17462
## $50,000-74,999 $75,000-99,999 $100,000+
## 17253 9609 10017
In Prosper 1.0, income range with the highest loan count is $25,000 - $49,999, with 7,983 count, and the second highest is $50,000 - $74,999, with 5,384 count. Both income ranges scored as the first and second in Prosper 2.0 too, with 17,462 and 17,253 count respectively. Notice that there’s only about 200 loan count different between the two ranges in the later period, but there was about 2,500 loan count different between them in earlier period. Earlier in borrower’s occupation section, we’ve seen the emergence of executive borrowers, and the loss of clerical and sales jobs. I wonder whether this shift in borrower’s occupation from lower to higher paying jobs might explain the increase in borrower’s income range. But again, we can’t say for sure right now without further research.
In Prosper 1.0, most loans did not even have a listing category posted on them. Additionally, there were only seven categories listed during that time, with Debt Consolidation posted most frequently. In Prosper 2.0, though, all loans have a listing category, and many new categories were added to the list. Debt Consolidation is still ranked first as a category with the highest loan count.
Here are six categories, ordered by loan count from high to low for Prosper 1.0 and Prosper 2.0:
## loan_origination_period listing_category count
## 1 Prosper 1.0 NA 16221
## 2 Prosper 1.0 Debt Consolidation 5067
## 3 Prosper 1.0 Personal Loan 2395
## 4 Prosper 1.0 Business 1874
## 5 Prosper 1.0 Other 1259
## 6 Prosper 1.0 Home Improvement 623
## loan_origination_period listing_category count
## 1 Prosper 2.0 Debt Consolidation 35340
## 2 Prosper 2.0 Other 7386
## 3 Prosper 2.0 Home Improvement 4773
## 4 Prosper 2.0 Business 3847
## 5 Prosper 2.0 Auto 1756
## 6 Prosper 2.0 Household Expenses 1399
Once again, we observed an indication of stricter loan application process during Prosper 2.0 through the absence of “NA” value in loan category. However, borrowers still have an option to avoid stating the purpose of their loan by selecting “others” instead.
For simplicity, note that I have regrouped loan status into four, namely: Defaulted, Past Due, Current, and Completed. Defaulted consists of defaulted and charged off loans, Past Due consists of all past due loans, Current consists of current and final payment in progress, and finally Completed consist only of completed loans.
As expected, all loans in Prosper 1.0, which should all be older than 36-month old, have been matured and either completed or defaulted. On the contrary, most loans in Prosper 2.0, which could still be less than 36-month old, were still current.
## loan_origination_period Defaulted Past Due Current Completed
## 1 Prosper 1.0 10395 0 0 17849
## 2 Prosper 2.0 4985 1373 36544 15777
As expected, there is no current loan from Prosper 1.0 because all loans from this period have matured and were closed. Interestingly, 37% loans (10,395 count) from this period ended up as defaulted loans, which is bad for the company and its investors. Fortunately, there is only 8% defaulted loans (4,985 count) for Prosper 2.0, and 2% loans were past due (1,373 count). However, since the bulk of the loans (62%, 36,544 count) were still current, the number of defaulted and past due might still increased over time. We will come back to this later.
Now is the time to look at closed loans, those that have been either completed or defaulted, and analyzed their return on investment and loss rate. Since some loans in 2011 and later years were still current, I will focus my analysis on loans prior to 2011.
The formula to calculate the Return on Investment (RoI) in this analysis is simple. It is the percentage of lender’s yield to the original loan amount. Here’s the frequency polygon of the RoI:
## prosper_36_closed$loan_origination_period: Prosper 1.0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.00000 -0.32000 0.11000 -0.04186 0.23000 1.73000
## --------------------------------------------------------
## prosper_36_closed$loan_origination_period: Prosper 2.0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.0000 0.0800 0.1600 0.1431 0.3300 1.2600
Notice that the first quartile of RoI in Prosper 1.0 is negative (-32.03), meaning more than 25% loans suffer a loss. Worst still, the distribution shows highest loan count for negative RoIs occurs in the -80% to -90% range value, which means most of those who did suffer a loss, lost almost their entire investment money. Furthermore, for those who profit from Prosper 1.0, their RoIs were concentrated in the 10% - 20% range.
In Prosper 2.0, the graph suggests an improvement in loan performance, though it is hardly a definitive conclusion due to a low annual loan origination count in the early years of this period. Still, less than 25% loans have negative RoI, and the distribution of negative RoI shows no concentration of loans in a certain negative range. However, highest loan count for positive RoI still remains in the 10% - 20% range as in previous period.
Loss Rate is another mean to measure loan performance. The formula used in this EDA is a net loss on principal divided by loan original amount. Here is the histogram of Loss Rate:
## prosper_36_loss$loan_origination_period: Prosper 1.0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0002273 0.5229000 0.7517000 0.6828000 0.8943000 1.0000000
## --------------------------------------------------------
## prosper_36_loss$loan_origination_period: Prosper 2.0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00039 0.45980 0.70890 0.64870 0.87910 1.00000
Note that the graph only shows positive loss.
Clearly, in Prosper 1.0, we see a higher concentration of loans with higher loss rate, while in Prosper 2.0, we observe a more distributed loss rate. Notice how these graphs resemble the RoI graphs for their negative part, in the opposite order. Remember how negative RoI in Prosper 1.0 increased in loan count from the 0% to -100% range, but here, loss rate increased in loan count from 0% to 100%. Obviously, a negative RoI correspond to a positive loss rate.
Earlier we’ve seen the histogram of total annual loan count from all three terms. Since we are now only focusing on the 36-month loans, it would be a good idea to take a look at the histogram again, and see the composition of loan count based on loan term:
## term 2006 2007 2008 2009 2010 2011 2012 2013 2014
## 1 12 0 0 0 0 12 547 782 272 0
## 2 36 5239 11425 11551 1989 5536 9651 12516 20706 8310
## 3 60 0 0 0 0 31 1030 6255 13367 3862
We’ve analysed earlier that Prosper started their business in 2006 offering the 36-month loan exclusively. We also know now that Prosper went into a quite period between late 2008 and mid 2009, and that upon it’s return, Prosper imposed a minimum credit score requirement. Now, we can see from the histogram and summary table above, that in 2009 and 2010, Prosper still far too short on its annual loan original count from the previous record, which was reached in 2008 with 11,551 count. Could this be an indication of Prosper losing borrowers due to its new minimum credit score requirement?
Nonetheless, the data suggest that the 12-month and 60-month loans options were not available until Nov 2010; indicating a possibility of an effort on behalf of the company to attract more borrowers by providing choices in loan terms. Noticeably, though annual loan count on the 36-month did not reach a new record until 2012 (12,516 count), total annual loan count in 2011 from all three terms was just shy of the 2008 record, suggesting that the availability of option might have actually worked on attracting more customers.
Since borrower’s credit score is known to impact his/her credit rating, let us see whether the minimum average credit score requirement, imposed in Prosper 2.0, has an impact in the determination of borrowers Risk Level:
What an interesting finding! We could instantly see the different patterns shown on the graphs. First thing first, notice that the graphs confirms the minimum average credit score in Prosper 2.0 is indeed higher than 600 for each Risk Levels.
Now, notice the stair steps appearance of boxplots in Prosper 1.0. Moreover, notice that their interquartile ranges do not overlap with each others. Clearly, the graph shows a nicely defined linear relationship between average credit score and risk level. As expected, a higher average credit score relates to a better risk level.
Conversely, boxplots in Prosper 2.0 suggest a weaker and less defined relationship between the two variables. Even a few borrowers with an average credit score higher than 800, were considered as a higher risk borrower in this period, when they would have been grouped as a lowest risk borrower in earlier period. In fact, since all borrowers in Prosper 2.0 have an average score higher than 600, they should have been at least in group D, if we were to follow the relationship suggested from Prosper 1.0 boxplots. Obviously, a major change in the determination of risk level was introduced after July 2009, such that: not only does it take a minimum average credit score, but it also take other factors into consideration.
Now let’s explore the borrower’s rate. First we will see its relationship with average credit score:
Despite overplotting, there seems to be a linear relationship between borrower’s rate and average credit score, noticeably in Prosper 1.0. Higher scores appear to correlate with lower rates, though high variability does seem to exist. The relationship also seem to hold for Prosper 2.0.
Now let us see the relationship between borrower’s rate and risk level:
Here we can see that in both periods, despite the high variability within each risk group, higher risk borrowers does appear to have a higher rate. Prosper 2.0 however, has a lower variability within each risk group with no overlap between each interquartile range and smaller range in general; suggesting a better defined relationship between Borrower Rate and Risk Level. We will come back to this later in the multivariate analysis for a deeper insight.
Then, let us see the relationship between Borrower Rate and Loan Original Amount:
Despite overplotting, notice that on the plot for Prosper 1.0, some smaller loans did manage to get a lower rate than bigger loans. In this case, we can say that bigger loans got higher rates. However, we do not see the smaller loans getting better rate from the Prosper 2.0 plot. On the contrary, many smaller loans have a higher rates than bigger loans in this period. I wonder whether loan original amount did actually affect the borrower’s rate. We will come back to this again in multivariate data analysis.
First, let us see the delinquency rate time series:
Notice that in general, delinquency rate has gone down over the year as shown by the linear model on the graph, which is an indication of an improvement on loan performance. Also notice how high the delinquency rates was in 2006. It only went down slightly from 0.40 to 0.39, and then to 0.33 in the next two years respectively. In 2009, when Prosper started to imposed minimum average credit score, delinquency rate dipped to 0.15, and went only slighly higher to 0.17 in the next year. In 2011, delinquency rate went up again to 0.20 before it went back down in the next two years. It is worth to mention though that there were still a bunch of active loans from 2011 foward, thus delinquency rates for these years might still go even higher. Therefore, I expect the slope of the linear model to be less than it is shown on the graph once the active loans have all matured and we got the final delinquency rates.
Now, let us see the other aspect of loan performance, the Return on Investment:
## prosper_36_closed$loan_origination_year: 2006
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.00000 -0.42000 0.09000 -0.07253 0.24000 1.19000
## --------------------------------------------------------
## prosper_36_closed$loan_origination_year: 2007
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.00000 -0.38000 0.09000 -0.06955 0.21000 1.73000
## --------------------------------------------------------
## prosper_36_closed$loan_origination_year: 2008
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.0000000 -0.1700000 0.1300000 -0.0009004 0.2400000 1.5900000
## --------------------------------------------------------
## prosper_36_closed$loan_origination_year: 2009
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.0000 0.0700 0.1500 0.1362 0.3100 1.2600
## --------------------------------------------------------
## prosper_36_closed$loan_origination_year: 2010
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.0000 0.0800 0.1600 0.1454 0.3400 0.8400
Notice that in general, mean RoI has gone up over the year, which is another indication of improving loan performance. Also notice that for three years before 2009, the annual mean RoI were below zero, and almost half ot the loans suffered negative return on investment annually. However, only less than a quarter of loans suffered the negative RoI in 2009 and 2010.
Furthermore, look at the distribution of RoI on the violin plots for each years. Notice how the proportion of the negative RoI has reduced over the year, suggesting a favorable change in loan performance.
Now, let us take a look at loss rate:
Note that only loans with loss rate higher than zero are counted in this graph.
In general, mean loss rate went only slighly down from 2006 to 2010. The distributions of loss rate as they were shown on violin plots above do not seem to change much, except that the peaks seem to occur on a lower value at a later plot and that the bottom of the plot seem to be wider in 2009.
Based on our analysis on delinquency rate, RoI, and loss rate above, we can say that loan performance does seem get better over the year. However, this conclusion is far from definitive since delinquency rates for the last three years after 2010 might still went even higher, and we don’t have the RoI and loss rate for these last three years to put into our consideration.
To get a better insight about delinquency rate, let see the trend within risk levels:
It turned out, as expected, loans a in higher risk group have a higher delinquency rate, except in a few cases such as in 2011 when the delinquency rate of HR group was slightly lower than E’s. Delinquency rates were more spread out in a wide range, covering the higher rate, in the three earlier years; but they were more cramped together in the lower and smaller range from 2009 onward.
Remember our analysis about average credit score: that since 2009, Prosper was no longer giving a loan for borrower with a credit score of less than 600, which correspond to credit score range for those in group E or HR in Prosper 1.0. Now, look at how high the delinquency rates for higher risk loans (grade E and HR) prior to 2009 were. No wonder why Prosper 2.0 started to implement the minimum average credit score in the first place. Seemingly, they hoped to get rid of higher delinquency rates in the E and HR group they had in Prosper 1.0. Surely, the maneuver seems to work, as we can see that delinquency rates have dropped and stayed well below 0.40 since 2009.
Now let us see the relationship between borrower’s rate and the risk level:
In general, borrower’s rate for a higher risk loan seems to be higher than those of lower risk loans. However, we can see higher variability and more outliers on boxplots prior to 2011. The apparent changes on boxplots from 2011 onward directed me toward another online research. I found out that Prosper used to have an auction style model to determine the interest rate of their loan, but they stopped using this model at the end of 2010, and started setting the rate based on the borrower’s risk level instead. Thus, that explains not only the high variability in rates prior to 2011, with some higher risk borrowers managed to get a lower rate while some lower risk borrowers unfortunately get a higher rate, but also the nicely defined linear relationship between Borrower Rate and Risk Level in 2011 and later, with lower risk relates to lower rate.
Now let us see a more complicated relationship between borrower’s rate, risk level, and average credit score: Note that only loans with an average credit score higher than 350 are shown in this plot.
What an interesting time series plots that we have here! These plots provide us with a valuable insight into the history of Prosper borrower’s rate. Notice how the vertical color pattern in graphs before 2009 changed into horizontal color patterns in graphs from 2011 onward. Clearly, this graph should be one of my final plot. I will elaborate more on the findings in the final plots section.
Furthermore, let us see the relationship between borrower’s rate, loan original amount, and risk level. Remember earlier in bivariate analysis, we’ve tried to plot the borrower’s rate vs loan original amount, but we didn’t gain much insight from it due to overplotting. So, let see whether adding risk level to the plot, might help:
Undoubtedly, there are many Prosper’s moments depicted in this time series. One thing in common throughout the years is that lower risk loans seemed to have a lower rate, although the rates for loans prior to 2011 do not have clear cut range between each risk levels as loans from the 2011 and after do. Obviously, this is another indication of a different rate determination method, namely auction style loans and non-auction (regular) ones.
Moreover, I also observe that prior to 2011, bigger loans do seem to have a higher rate, even for loans from the same risk level. However, that is not the case from 2011 onward. In this later periods, each risk level seemed to have a pre-determined borrower’s rate, and loan amount doesn’t seem to affect the rate.
Then, I noticed the steps appearence on the graph from 2011 onward, indicating a possible maximum loan amount allowed for each risk levels. For example, in 2013 and 2014, only borrowers in grade B or higher could get a loan higher than $25,000, and only borrowers in grade C or higher could get a loan higher than $15,000. Maximum loan amount in each risk levels has changed over the year which can clearly be seen from the graph above.
Based on our analysis on borrower’s rate above, I would say that, in a borrower’s perspective, getting a loan from Prosper prior to the quite period is easier than getting one after it. For one, in Prosper 2.0, you have to have a minimum credit score of 600 to begin with, and then, since Prosper has stopped auctioning loan rates, borrowers in higher risk groups can no longer get a lower interest rate deal. However, if you are a borrower with an excellent credit score and a lower risk level (as determined by the company), you surely will get a better deal in Prosper 2.0 than you would in Prosper 1.0, especially when need to borrower a larger amount of money.
As a start, I would like to see the relationship between return on investment (RoI) and delinquency rate by risk level:
Notice how the relationship has changed over the year. In 2006, higher risk loans appear to have lower average RoI and higher delinquency rate; but in 2010, higher risk loans appear to have higher RoI, although they still have higher delinquency rate. Obviously, Prosper has managed to introduce some changes in its loan policy to alter the trend, so that a higher risk loan relate to a higher average RoI, which of course makes it more attractive to investors despite its high delinquency rate. Of course, for those who prefer better assurance, lower risk loans are always a choice with lower delinquency rate and lower average RoI.
Now let us see return on investment by risk levels in time series:
Look at the obvious negatively high mean and median RoI for HR risk level in 2006. It was not until 2008 when at least half of loans in this level produced a positive RoI, though the mean RoI was still in the negative side. Remember that E was the risk level with second highest delinquency rate in the first three years of Prosper 1.0. Surprisingly, the mean and median RoI suggests a far better performance in this risk level comparte to the mean and median RoI in HR risk level, except in 2010, when both mean and median RoI from HR risk level are higher than those from E risk level.
Notice that after 2008 RoI on all risk levels seem to improve, and all of them finally scored a positive mean RoI. Surprisingly, lower risk levels have a better average RoI in both 2009 and 2010.
Let us now take a look at a similar graph, this one is for the loss rate:
Look at all those crazy lines! Notice how the mean loss rate of loans in lower risk groups were trending up from 2006 to 2008, but Prosper managed to bring them down in 2009 and 2010. Furthermore, mean loss rate of loans in higher risk groups, in general, seem to go down slighly over the years. Also, notice that in the 2009 and 2010 periods, the lines seem to stays closer together in two groups, lines in reds, representing higher risk loans, stay in one group on higher range, while lines in greens, representing lower risk loans, stay in another group on lower range.
Based on the graph of median loss rate above, we can say that in 2009 and 2010, half loans from each lower risk levels (C to AA) that suffered a principal loss, have a loss rate lower than 0.68 (mostly well below the 0.6 mark); but half loans from each higher risk levels (HR to D) that suffered a principal loss, have a loss rate higher than 0.74.
Now, let us see a plot between RoI and loan original amount, grouped by risk level:
Unfortunately, there’s not much we can see here, except that from 2006 to 2009 more and more lower risk loans with a higher loan original amount having a negative RoI but less so from 2009 to 2010. In general, there were less percentage of loans with negative RoI in 2009 and 2010 compared to earlier years despite the loan original amount and risk level.
Here’s a plot between loss rate and loan original amount, grouped by risk level:
Notice that the plot only shows observations with a loss rate higher than zero. Again, due to overplotting, this plot is hard to interpret. However, we can still see that most of the loans with a loss rate higher than zero are from higher risk loans, and the most lower risk loans are seen on the plot for the 2008.
After exploring delinquency rates, return on investment, and loss rate for each risk levels, I would say that, thinking as an investor, I will be more confident investing in Prosper 2.0 than I would in Prosper 1.0. For sure, I would have sticked to lower risk loans if I had to invest in earlier period, but I’m eager to diversify my investment into higher risk loans in Prosper 2.0 due to its attractively high RoI.
I chose this plot as my first final plot because it depicts a rather detailed story, not only about Prosper’s performance on originating loans, but also the performance of their loans over the year based on risk level. Obviously, the histogram on top shows annual loan origination count, and the graph on the bottom shows delinquency rates on each risk levels.
As we’ve talked about earlier, Prosper opened its marketplace to the public in early 2006, in which, the young company managed to originate a little more than 5,000 loans. Unfortunately, only time revealed that around 40% of loans granted this year went into default. In the second year of their business, Prosper successfully granted more than 11,000 loans, a 100% increase on annual loan count from the previous year. Yet, still around 40% of these loans went into default. Then, in 2008, though they managed to originate about the same number of loans as in 2007, a trouble with the U.S. Security and Exchange Commission surfaced, which lead to the SEC serving Prosper with a cease and desist order later that year, barring it from granting new loans. On top of that, still more than 30% of loans originated in 2008 went into default.
Prosper returned to business in mid 2009 after completing the required registration with the SEC. However, the company seemed to only be able to generate less than 2,500 loans in 2009 and a little more than 5,000 loans in 2010. Fortunately, they were able to keep delinquency rate well below the 25% mark in both years. Then, in 2011, Prosper still didn’t manage to grant as many loans as it did before the cease and desist order, with only less than 10,000 originated that year.
We’ve known from previous analysis about loan term, that some time in 2011, Prosper started to offer variation in loan terms, offering a 12 month and a 60 month loan, in addition to the old 36 month loan option. Looking at the low annual loan origination count since 2009, the introduction of variation in loan term could probably be a strategy to attract more borrowers. As we can see, annual loan origination count for the 36 month loan did increase in the following two years, it even surpassed the previous highest annual count set in 2008 (11,551 count). Considering additional loans originated for the 12 month and 60 month term, Prosper seemed to be growing even better in its annual total loan origination count. However, delinquency rate seemed to be creeping back up in 2011. Although it was still below the 25% mark, there were about 30% loans still current (active) by that time, posing a possibility of more defaulted loans. Similarly, it is hard to precisely predict delinquency rate for 2012 and 2013 since more than half of the loans were still active.
Looking a little closer at delinquency rates based on risk level as shown in the bottom graph, we can see, as expected, that higher risk loans have a higher delinquency rate. However, if we look at the rate between 2006 to 2008, while delinquency rates for higher risk loans were trending slightly down, the rates for lower risk loans were in fact trending up. I expect that this trend played an important part in Prosper decision to have a minimum average credit score on new loans after the 2008 - 2009 quiet period. Moreover, with the increasing delinquency rate on lower risk loans between 2006 - 2008, suggesting that even higher credit score (the primary factor that affect Credit Grade) cannot guarantee that a borrower would not go into default, Prosper decided updated their method in determining the risk level of a loan by adding more factors into consideration.
As we can see, the combination of minimum average credit score requirement and the new method in determining the risk level seemingly worked on reducing delinquency rate across risk levels (Prosper Rating). Unfortunately, delinquency rate of higher risk loans seemed to creep back up in 2011 with more than 25% loans were still active (current), though the rate was still well below 40% mark.
I chose this graph as my second final plot because it tells a story in a sequential order about changes made by Prosper in regard to borrower’s rate. On the first three graphs (2006 - 2009), we can see that each color, representing Risk Level, has a certain range on the x-axis, representing Average Credit Score, without overlapping each other range. This suggests that, during this time, risk level was determined primarily (if not solely) based on borrower’s average credit score. Then, through the vertical pattern, we can see that each color has a long range on the y-axis, representing Borrower Rate, with higher risk loan having bigger range and lower risk loan having smaller and lower range. We can clearly see this pattern as a depiction of the auction-style model on loan rate used during this period. Furthermore, the lower bound on the x-axis on these graphs suggests that Prosper might not have a minimum credit score requirement until 2008, when they started to impose a minimum average credit score of 500. Probably, in its first two years, Prosper was still trying to attract borrower to its marketplace, and thought that the less requirement it had, the more inclined a prospective borrower to try their product.
On the 2009 and 2010 graphs, the vertical pattern curiously ceased. We can clearly see that each Risk Level was no longer bound in a certain non-overlapping range, suggesting that Prosper had started using a new method on how to determine a Risk Level of a loan, which then no longer depend solely on the Average Credit Score as it previously did. Consequently, we see the vertical pattern blurred out as points in each Risk Level were no longer constrained in a certain credit score range. Additionally, lower bound on the Average Credit Score suggests an increase in minimum Average Credit Score requirement to 600, a reasonable move considering high delinquency rate on previous higher risk loans (Risk Level E and HR) as shown in final plot one.
On the graphs after 2010, a horizontal pattern appeared, suggesting another changing in Prosper business model. This time, each color seemed to be bound in a certain range on the y-axis, the Borrower Rate, indicating that Prosper has ended the auction-style on loan rate and move into a more conventional approach of a predetermined rate based on Risk Level. Finally, the most current change is shown on the last graph, where the lower bound on the x-axis suggest another increase in minimum Average Credit Score requirement to 650.
I chose this plot as the last final plot in this EDA because it offers an answer to my initial question about loan performance before and after the SEC registration period. The wide interquartile range, well into the negative value, on boxplots for 2006, 2007, and 2008 suggests more than a quarter of loans in these years suffered a negative RoI - quite a dissapointing loan performance. In fact, if we look at the first quartile plot on the top right, we can see that only loans in risk level A and AA performed better than general during this period, with less than a quarter of loans in these level suffered a negative RoI. Of course, as expected, the worst performance belongs to risk level HR, with more than half loans suffered a negative RoI, except in 2008.
Fortunately for investors, Prosper has managed to improved the performance of loans originated after the quite period. Even better, those who were risking their money on higher risk loans (risk level D, E, or HR) seems to get higher return on their investment. In fact, if we look at the median and third- quartile plot, in 2009 and 2010, we can see that half loans in risk level D, E, or HR, in each of these years, have a better RoI than three-third of loans in risk level A, or AA in the same year.
However, there’s a downside to this final plot: annual loan origination count in 2009 and 2010 were relatively low compare to annual count in earlier years. Then, the fact that delinquency rate for loans originated in 2011, where annual count were about the same as the 2007 and 2008 number, were up again after dipping to a lowest record in 2009, should make us less optimistic about the improvement suggested on the graph above.
I cannot believe that I have finally reached this part of the project. Honestly, this Exploratory Data Analysis had taken more time, required more skills, and demanded more study, than I initially anticipated. Fortunately, I can proudly say that I enjoyed every twist and turn I found during the process.
When I decided to choose the Prosper Loan Dataset, I know almost nothing about Prosper and Peer to Peer Lending platform. I know that I have to do some research to familiarize myself with the dataset, and generally set a focus for my EDA. However, I decided not to do so. Instead, I started to work on the wrangling process, and then proceed to the EDA straight away, hoping that the data will reveal something interesting for me to investigate. Interestingly, this approach works. I began to find hints from plots that lead me to interesting facts about the company, such as the quiet period between 2008 to 2009, and the fact that Prosper only started to offer 12-month and 60-month loans in 2011. Seriously, every time I found something about the company that stemmed from my analysis of the dataset, I felt triumphant and wanted to say “Ha! You can’t hide from me, I gotcha!”
Undoubtedly, there were many challenges that I had to face and overcome along the way, and overplotting was the biggest challenge of them all. It is especially true when I tried to look into the relationship between the borrower rate and the average credit score before and after the quiet period. Not until I added more variable, and faceted the graph, did I found the trend suggesting a constantly changing in the relationship between these variables over the years.
Of course, not all initial findings can actually be pursued further. For example, I was excited to find a shift on the list of states with the highest loan count and also a shift on the list of occupations with the highest loan count. I wondered, could the fact that New York had emerged as the second state with the highest loan count after the quiet period, replacing Georgia who previously held the spot, had any relation to the rise of Executive job as the third occupation with the highest loan count? Unfortunately, it is impossible to draw a conclusion about this, since the majority of loans listed borrower’s occupation as other or nothing listed at all, especially in the period before 2009, masking the true composition of borrower’s occupation.
As expected, a huge amount of trials and errors has become a part of this long journey, which mostly occurred in the visualization process as an effort to display complex relationships in a simple and easy-to-understand graphic. It is truly a humbling experience that makes me appreciate every work on data visualization even more, and from which I welcomed the idea that the ability to find relevant coding resources online goes a long way in helping me finish this project.
Definitely, there are still many variables in the dataset that I have not explored and there is also outside competition aspect that I have not put into consideration in this EDA, that might enhance or even change the analysis and findings in this report. Furthermore, additional data from current years, and updated data for newly completed or defaulted loans, surely would help to better determine loan performance after the quiet period.